Exercise 16: Simple regression

This exercise uses the data set BLDPRES.SAV. We focus on the relation between age and systolic blood pressure.We will first look in the women, and then in men.

Women

    First we look at the women in our dataset.
  1. If not already done, start RStudio. Open the file BLDPRES.sav.Since this is a .sav file, we need to import this using a function from the foreign package: read.spss().

  2. Select the subset of women in the dataset and use this subset to plot a scatter plot of sys against age. Fit the regression line of sys as independent variable and age as dependent variable over the scatter plot.

  3. Summarize the regression and interpret the results: 1) Give the equation of the regression line. 2) Give the estimate of the increase in mean systolic blood pressure per year of age and give the corresponding 95% confidence interval for it. How is the confidence interval computed? 3) How should the R2 be interpreted?

  4. Add on top of the plot of b the confidence bands around the regression line. Note that the confidence interval is most narrow at the mean value of age (39.1).

  5. Plot the diagnostics of the regression and interpret the results.

    1. Hints

      ###################################################
      #  part a                                         #
      ###################################################
      library(foreign) # load the package
      df <- read.spss(file = ..., to.data.frame = TRUE) # import your own datafile
      
      ###################################################
      #  part b                                         #
      ###################################################
      df.w <- df[df$sex == "female", ]
      fit <- lm(... ~ ..., data = df.w)
      plot(y = ..., x = ...)
      abline(..., col = "red")
      
      ###################################################
      #  part c                                         #
      ###################################################
      summary(...)
      
      ###################################################
      #  part d                                         #
      ###################################################
      library(ggplot2)
      ggplot(df.w, aes(y = ..., x = ...)) + geom_smooth(method = "lm")
      
      ###################################################
      #  part e                                         #
      ###################################################
      plot(...)

      Code

      ###################################################
      #  part a                                         #
      ###################################################
      library(foreign) # load the package
      df <- read.spss(file = "./data/BLDPRES.sav", to.data.frame = TRUE) # import your own datafile
      
      ###################################################
      #  part b                                         #
      ###################################################
      df.w <- df[df$sex == "female", ]
      fit.w <- lm(sys ~ age, data = df.w)
      plot(y = df.w$sys, x = df.w$age)
      abline(fit.w, col = "red")
      
      ###################################################
      #  part c                                         #
      ###################################################
      summary(fit.w)
      
      ###################################################
      #  part d                                         #
      ###################################################
      library(ggplot2)
      ggplot(df.w, aes(y = sys, x = age)) + geom_smooth(method = "lm")
      
      ###################################################
      #  part e                                         #
      ###################################################
      plot(fit.w)

      Output

      ## 
      ## Call:
      ## lm(formula = sys ~ age, data = df.w)
      ## 
      ## Residuals:
      ##     Min      1Q  Median      3Q     Max 
      ## -35.639  -9.642  -1.144   9.356  51.864 
      ## 
      ## Coefficients:
      ##              Estimate Std. Error t value Pr(>|t|)    
      ## (Intercept) 105.64871    3.23658  32.642  < 2e-16 ***
      ## age           0.49983    0.07688   6.501 7.52e-10 ***
      ## ---
      ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      ## 
      ## Residual standard error: 16.23 on 181 degrees of freedom
      ## Multiple R-squared:  0.1893, Adjusted R-squared:  0.1848 
      ## F-statistic: 42.26 on 1 and 181 DF,  p-value: 7.52e-10

      Discussion

      The regression was significant (p<0.001). The regression equation is: sys = 105.6 + 0.50 x age. Per year increase of age, 0.50 increase in systolic blood pressure is the result. The 95% CI is 0.50±1.96x0.077: 0.35 - 0.65. The R2 should be interpreted as the % variability in systolic blood pressure explained by the model (intercept and age), which is 18.9% in this case. The first plot is showing the residuals plotted against the fitted values. It is observed that the residuals are normally distributed around the mean, and with equal standard deviation for each value of x (homoscedascity). The second plot (Q-Q plot) is better able to inform about the normality assumption, which holds for this regression. The other two plots show the leverage of each individual patient to the fitted model. There are no clear outliers.

Men

    Next, we will investigate the men subset.
  1. Repeat the analysis for women in men: fit the model in this subgroup, make a scatterplot and show the diagnostic plots. Compare the model of men and women.

  2. Make a scatterplot for the whole group, and plot the lines for men and women both on the plot.

    1. Hints

      ###################################################
      #  part b                                         #
      ###################################################
      plot(...)
      abline(fit.w, col = "red")
      abline(fit.m, col = "green")
      legend("topleft", legend = c("...", "..."), col = c("...", "..."), lty = 1)

      Code

      ###################################################
      #  part a                                         #
      ###################################################
      df.m <- df[df$sex == "male", ]
      fit.m <- lm(sys ~ age, data = df.m)
      plot(y = df.m$sys,x = df.m$age)
      abline(fit.m, col = "red")
      summary(fit.m)
      plot(fit.m)
      
      ###################################################
      #  part b                                         #
      ###################################################
      plot(y = df$sys, x = df$age)
      abline(fit.w, col = "red")
      abline(fit.m, col = "green")
      legend("topleft", legend = c("Female", "Male"), col = c("red", "green"), lty = 1)

      Output

      ## 
      ## Call:
      ## lm(formula = sys ~ age, data = df.m)
      ## 
      ## Residuals:
      ##     Min      1Q  Median      3Q     Max 
      ## -46.453 -10.832  -1.377   8.435  49.901 
      ## 
      ## Coefficients:
      ##              Estimate Std. Error t value Pr(>|t|)    
      ## (Intercept) 115.56903    3.93895  29.340   <2e-16 ***
      ## age           0.24436    0.09444   2.587   0.0107 *  
      ## ---
      ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      ## 
      ## Residual standard error: 16.33 on 143 degrees of freedom
      ## Multiple R-squared:  0.04472,    Adjusted R-squared:  0.03804 
      ## F-statistic: 6.695 on 1 and 143 DF,  p-value: 0.01067

      Discussion

      It is observed that the slope of men (0.24) is lower than for women (0.50). For men as well, the normality and homoscedascicity assumption is fulfilled. In the scatter plot for both sexes, it is observed that the lines cross.

Exercise 17: Multiple regression I

This exercise concerns the data set BLDPRES.SAV. We want to study with multiple regression in R the joint relationship of age, body weight and pulse rate with the diastolic blood pressure.

Exploring predictors

    First we look at the simple relationships between each of the predictor variables and the diastolic pressure. Because the same analysis has to be done three times now, it is useful to paste and copy the commands. The procedure is the following:
  1. If not already done, start RStudio. Open the file BLDPRES.sav.Since this is a .sav file, we need to import this using a function from the foreign package: read.spss().

  2. Specify a scatter plot, with dias as Y variable and age as X-variable.Do a linear regression with dias as dependent and age as independent variable. Add the regression line over the scatterplot.

  3. Now repeat b, but now with weight and pulse as independent variables, instead of age. Repeat the steps of b, and make a similar scatterplots.

  4. Give the regression coefficients and their significance. What is the interpretation of these regression coefficients? What percentage of variability in the diastolic blood pressure is explained by the predictors seperately?

Hints

###################################################
#  part a                                         #
###################################################
library(foreign)
read.spss(..., to.data.frame = TRUE) 

###################################################
#  part b and c                                   #
###################################################
fit <- lm(formula = ..., data = df) # do a linear regression 
plot(...) # plot the scatterplot
abline(fit, col = "red") # add the regression line over the plot

###################################################
#  part d                                         #
###################################################
summary(fit) 

Code

###################################################
#  part a                                         #
###################################################
library(foreign) # load the package
df <- read.spss(file = "./Data/BLDPRES.SAV", to.data.frame = TRUE) # import your own datafile

###################################################
#  part b                                         #
###################################################
fit.age <- lm(dias~age, data=df) # do a linear regression of age on dias
plot(y = df$dias, x = df$age, ylab = "Diastolic blood pressure", xlab = "Age in years", main = "Dias versus age") # plot the scatterplot
abline(fit.age, col = "red") # add the regression line over the plot

###################################################
#  part c                                         #
###################################################
# with weight as independent variable
fit.weight <- lm(dias ~ weight, data = df) # do a linear regression of weight and pulse on dias
plot(y = df$dias, x = df$weight, ylab = "Diastolic blood pressure", xlab = "Weight", main = "Dias versus weight") # plot the scatterplot
abline(fit.weight, col = "red") # add the regression line over the plot
# with pulse as independent variable
fit.pulse <- lm(dias ~ pulse, data = df) # do a linear regression of weight and pulse on dias
plot(y = df$dias, x = df$pulse, ylab = "Diastolic blood pressure", xlab = "pulse", main = "Dias versus pulse") # plot the scatterplot
abline(fit.pulse, col = "red") # add the regression line over the plot

###################################################
#  part d                                         #
###################################################
summary(fit.age)
summary(fit.weight)
summary(fit.pulse)

Output

## 
## Call:
## lm(formula = dias ~ age, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.2944  -7.2955  -0.1049   6.4850  27.6855 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 77.99622    1.55712  50.090   <2e-16 ***
## age          0.07997    0.03714   2.153    0.032 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.13 on 326 degrees of freedom
## Multiple R-squared:  0.01402,    Adjusted R-squared:  0.011 
## F-statistic: 4.636 on 1 and 326 DF,  p-value: 0.03204
## 
## Call:
## lm(formula = dias ~ weight, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.2729  -6.6191   0.2111   5.6101  30.8809 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 66.99104    2.42989  27.570  < 2e-16 ***
## weight       0.27564    0.04622   5.963 6.44e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.69 on 326 degrees of freedom
## Multiple R-squared:  0.09835,    Adjusted R-squared:  0.09558 
## F-statistic: 35.56 on 1 and 326 DF,  p-value: 6.437e-09
## 
## Call:
## lm(formula = dias ~ pulse, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.8296  -6.8272  -0.1902   6.1872  28.6704 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 72.04171    3.96098  18.188   <2e-16 ***
## pulse        0.11057    0.04773   2.316   0.0212 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.12 on 326 degrees of freedom
## Multiple R-squared:  0.01619,    Adjusted R-squared:  0.01317 
## F-statistic: 5.366 on 1 and 326 DF,  p-value: 0.02116

Discussion

We can see that the coefficients of age, weight and pulse are 0.08, 0.28 and 0.11 respectively. The p-values of the Wald tests show p-values <0.05 for all three coefficients. This can interpreted by saying that one unit increase in the variable results significantly in one unit increase in diastolic blood pressure. For example: one year increase in age results in 0.08 mmHg increase of diastolic blood pressure. The percentage of explained variability in the diastolic blood pressure is 1.4%, 9.8% and 1.6% for age, weight and pulse, respectively. This means that weight explains the variability in diastolic blood pressure the most of the three investigated variables.

Correlation of predictors

    After initial exploration of the predictors, we continue by checking the assumption that predictors are not colinear.
  1. Obtain the correlation coefficients between the predictor variables and examine these.

Hints

plot(...) # inserting a data.frame with the columns of interest will plot all pairwise scatterplots
corr <- cor(x =... , # insert the same data.frame
          use = "pairwise.complete.obs", method = "...") # calculate correlation coefficients. Spearman or Pearson?
install.packages("corrplot") # the first time 
corrplot::corrplot(corr, method = "square", type ="upper") # display graphically the correlation between variables, from the corrplot package

Code

df.cor <- df[, c("pulse", "weight", "age")]
plot(df.cor)
corr <-cor(x = df.cor, use = "pairwise.complete.obs", method = "pearson")
corrplot::corrplot(corr, method = "square", type ="upper")

Output

Discussion

We can observe that little correlation is found between the variables. Visual inspection of the scatterplots showed normality of the binomial distributions, therefore we used the pearson correlation coefficient to inspect correlation. Since correlation is low, we can include all three in the model.

Multivariable regression

    Now we will continue by fitting a multiple regression model with dias as dependent variable and age, weight and pulse as independent variables.
  1. Fit this model.
  2. Summarize the results and retrieve the coefficients and significance of the coefficients. Explain what the R-squared, adjusted R-squared, F-statistic and p-value in the summary means,
  3. Plot the diagnostic plots of this model and explain the output.

Hints

lm(dias ~ ... + ... + ..., data = df) #For multiple regression, use the + sign between predictors

Code

###################################################
#  part a                                         #
###################################################
fit.mult <- lm(dias ~ age + weight + pulse, data = df)

###################################################
#  part b                                         #
###################################################
summary(fit.mult)

###################################################
#  part c                                         #
###################################################
plot(fit.mult)

Output

## 
## Call:
## lm(formula = dias ~ age + weight + pulse, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.1432  -6.1882  -0.0231   5.4479  28.8880 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 54.49560    4.60553  11.833  < 2e-16 ***
## age          0.08023    0.03510   2.286   0.0229 *  
## weight       0.28194    0.04556   6.188 1.84e-09 ***
## pulse        0.10996    0.04516   2.435   0.0154 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.544 on 324 degrees of freedom
## Multiple R-squared:  0.1308, Adjusted R-squared:  0.1227 
## F-statistic: 16.25 on 3 and 324 DF,  p-value: 7.332e-10

Discussion (1)

All three variables have a significant effect on diastolic blood pressure. The interpretation is as follows: conditional on weight and pulse (“keeping these variables constant”), a year increase in age results in a 0.08 mmHg increase in diastolic blood pressure. The interpretation of the other coefficients are similar.The R-squared should be interpreted as the fraction of variance explained by the model, which is 0.13. The adjusted R-squared is adjusted for the amount of degrees of freedom spent in the analysis. Since this is lower than the normal R-squared, a low amount of degrees of freedom are spent for a high explained variance. The highest R-squared of the univariable analysis was weight: 0.098. The three variables combined are more explaining the variance of diastolic combined than each of them seperately.The F-test is the anova test with the H0: beta(age)=beta(weight)=beta(pulse)=0. The value of 16.25 on 3 df results a p-value <0.001. Therefore this model is significantly explaining diastolic blood pressure.

Discussion (2)

The first plot shows the residuals of diastolic blood pressure plotted versus the fitted values of the model. We are looking for normality of the y axis, because we assumed that the residuals are normally distributed for each value of x. We also do not observe heteroscedasticity in the residuals: the variation around the mean is equal for each value of the fitted value. The next plot shows how the standardized residuals correspond to what they would be when theoretically they would be normally distributed. Since the dots follow the diagonal, this means that the residuals are normally distributed for each value of x. The third plot is similar to the first plot, but then the square root of the residuals are plotted against the fitted values. The information from this plot is however similar to the first plot. The third plot indicates whether outliers are distorting the model significantly. This seems not to be the case.

Exercise 18: Multiple regression II

This exercise concerns the data of the 25 cystic fibrosis patients given in the file CYSTFIBR.SAV. We want to predict with multiple regression the residual volume, RV, on the basis of body height and weight.

Univariable exploration

    First we look at the simple relationships between each of the predictor variables and the diastolic pressure. Because the same analysis has to be done three times now, it is useful to paste and copy the commands. The procedure is the following:
  1. If not already done, start RStudio. Open the file CYSTFIBR.SAV. Since this is a .sav file, we need to import this using a function from the foreign package: read.spss().

  2. Fit the simple regression models of RV based on HEIGHT and next RV on WEIGHT. Examine the estimated regression coefficients and their significance.

Hints

###################################################
#  part a                                         #
###################################################
library(foreign)
read.spss(..., to.data.frame = TRUE) 

###################################################
#  part b                                         #
###################################################
lm(... ~ ..., data = ...) # to fit a linear model
summary(...)              # to obtain summary statistics

Code

###################################################
#  part a                                         #
###################################################
library(foreign) # load the package
df <- read.spss(file = "./data/BLDPRES.sav", to.data.frame = TRUE) # import your own datafile

###################################################
#  part b                                         #
###################################################
fit.height <- lm(RV ~ HEIGHT, data = df)
summary(fit.height)

fit.weight <- lm(RV ~ WEIGHT, data = df)
summary(fit.weight)

Output

## 
## Call:
## lm(formula = RV ~ HEIGHT, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.99  -41.91  -14.34   29.50  162.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 603.3592   105.7678   5.705 8.27e-06 ***
## HEIGHT       -2.2785     0.6857  -3.323  0.00296 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 72.22 on 23 degrees of freedom
## Multiple R-squared:  0.3244, Adjusted R-squared:  0.295 
## F-statistic: 11.04 on 1 and 23 DF,  p-value: 0.002962
## 
## Call:
## lm(formula = RV ~ WEIGHT, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.691  -58.980   -2.396   39.321  137.178 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 369.9091    33.1439  11.161 9.26e-11 ***
## WEIGHT       -2.9869     0.7851  -3.805 0.000913 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.84 on 23 degrees of freedom
## Multiple R-squared:  0.3863, Adjusted R-squared:  0.3596 
## F-statistic: 14.48 on 1 and 23 DF,  p-value: 0.0009126

Discussion

We can see that the coefficients of height and weight are both significant (p<0.05). Both regressions have a significant anova test, so the models significantly explains variation in RV (H0: beta(…)=0).

Multivariable regression

    Next, we fit the model with both WEIGHT and HEIGHT to predict RV.
  1. Fit this model, and summarize the results. Try to explain why the individual coefficients are not significant, in contrast to the overall model.

  2. What is, based on the model in a, the expected RV value for an individual with height 150 cm and weight 45 kg?

  3. To chech the model assumptions, one could look at the residuals, plotted against the predictor variables. Create these two plots. With some imagination, you can discover a curvature in one of the plots. Therefore, addition of the square of this predictor could improve the model.

Hints

###################################################
#  part a                                         #
###################################################
lm(... ~ ... + ...)  # this function should look familiar

###################################################
#  part b                                         #
###################################################
case <- c(..., ...)    # specify a vector, with the height and weight
itcp + betas %*% case  # the RV is equal to the intercept plus the 
                       # matrix multiplication of the betas and the
                       # case (%in% sign)

###################################################
#  part c                                         #
###################################################
fit$residuals          # to extract the residuals
plot(y = ..., x = ...) # plot these against...?

Code

###################################################
#  part a                                         #
###################################################
fit <- lm(RV ~ WEIGHT + HEIGHT, data = df)
summary(fit)

###################################################
#  part b                                         #
###################################################
betas <- coef(fit)[2:3]
itcpt <- coef(fit)[1]
case <- c(WEIGHT = 45, HEIGHT = 150)
itcpt + betas%*%case

###################################################
#  part c                                         #
###################################################
res <- fit$residuals     
plot(y = res, x = df$HEIGHT)
plot(y = res, x = df$WEIGHT)

Output

## 
## Call:
## lm(formula = RV ~ WEIGHT + HEIGHT, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103.19  -58.34   -2.34   38.61  136.48 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 362.09272  191.90423   1.887   0.0724 .
## WEIGHT       -3.06526    2.05664  -1.490   0.1503  
## HEIGHT        0.07085    1.71209   0.041   0.9674  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70.38 on 22 degrees of freedom
## Multiple R-squared:  0.3863, Adjusted R-squared:  0.3305 
## F-statistic: 6.925 on 2 and 22 DF,  p-value: 0.004649
##          [,1]
## [1,] 234.7831

#### Discussion

We can see that both wald tests of the predictors are non-significant (p>0.05). This does not result in a non-significant model, because 6.925 F with 2 degrees of freedom result in a p=0.005. Therefore the model is significant. The reason that the Wald tests are non-significant is because these are highly colinear (high correlation between height and weight). The case presented should have a RV of 234.8. Furthermore, the plots show that there is no trend visible between the residuals and WEIGHT, but there is a trend visible between the residuals and HEIGHT.

Improving the model

    Next, we fit try to improve the model, by taking.
  1. As discussed in the previous question, add a squared version of the predictor of the model and summarize the findings. Does it have a significant contribution?

  2. Calculate the joint significance of the variable: the linear and the squared version.

Hints

###################################################
#  part a                                         #
###################################################
lm(... ~ ... + ... + I(... ^  2))   # this is how to put a squared version in the formula

###################################################
#  part b                                         #
###################################################
anova(..., ..., test = "LRT")     # which models should be compared...?

Code

###################################################
#  part a                                         #
###################################################
fit2 <- lm(RV ~ WEIGHT + HEIGHT + I(HEIGHT ^ 2), data = df)
summary(fit2)

###################################################
#  part b                                         #
###################################################
anova(fit2, fit.weight, test ="LRT")

Output

## 
## Call:
## lm(formula = RV ~ WEIGHT + HEIGHT + I(HEIGHT^2), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.439  -53.586   -6.107   39.339  138.612 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 669.59734  756.08388   0.886    0.386
## WEIGHT       -3.61590    2.47071  -1.464    0.158
## HEIGHT       -4.38336   10.72203  -0.409    0.687
## I(HEIGHT^2)   0.01657    0.03936   0.421    0.678
## 
## Residual standard error: 71.73 on 21 degrees of freedom
## Multiple R-squared:  0.3915, Adjusted R-squared:  0.3045 
## F-statistic: 4.503 on 3 and 21 DF,  p-value: 0.01369
## Analysis of Variance Table
## 
## Model 1: RV ~ WEIGHT + HEIGHT + I(HEIGHT^2)
## Model 2: RV ~ WEIGHT
##   Res.Df    RSS Df Sum of Sq Pr(>Chi)
## 1     21 108062                      
## 2     23 108983 -2    -920.7   0.9144

Discussion

The answer to a is that the squared version of height does not contribute significantly to the model: the Wald test shows a p=0.678. When testing for joint significance of Height and height squared, the likelihood ratio test shows a p=0.914. H0 of this test is that the model without versus the model with height and height squared are equally predictive for RV. This hypothesis can not be rejected.

Exercise 19: Covariance analysis I (2 groups)

This exercise again concerns the data of the 25 cystic fibrosis patients given in the file:_CYSTFIBR.SAV_. The problem is about the difference in mean FEV1 between male and female patients, adjusted for body weight by covariance analysis.

Crude difference

    The first step is to estimate the crude difference in mean FEV1 between male and female patients. We will use regression to do this.
  1. If not already done, start RStudio. Open the file CYSTFIBR.SAV. Since this is a .sav file, we need to import this using a function from the foreign package: read.spss().

  2. Fit the simple regression models of FEV1 based on SEX. Examine the estimated regression coefficients. What is the interpretation of the coefficients in this case? Give the 95% CI and p-value.

  3. Check that the results obtained in b are in accordance with a t-test.

Hints

###################################################
#  part a                                         #
###################################################
install.packages("foreign")  # install package first time
library(foreign)
read.spss(..., to.data.frame = TRUE) 

###################################################
#  part b                                         #
###################################################
fit.crude <- lm(...)         
summary(...)             
confint(...)              # to obtain the confidence intervals of the coefficients

###################################################
#  part c                                         #
###################################################
t.test(formula = ..., data = df)  # what formula to put in...?

Code

###################################################
#  part a                                         #
###################################################
install.packages("foreign")
library(foreign)
df <- read.spss("./Data/CYSTFIBR.SAV", to.data.frame = TRUE)

###################################################
#  part b                                         #
###################################################
fit.crude <- lm(FEV1 ~ SEX, data = df)
summary(fit.crude)
confint(fit.crude)

###################################################
#  part c                                         #
###################################################
t.test(FEV1 ~ SEX, var.equal = TRUE, data = df)

Output

## 
## Call:
## lm(formula = FEV1 ~ SEX, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.8571  -6.8571  -0.8571   5.1429  17.1429 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   39.857      2.596  15.356  1.4e-13 ***
## SEXfemale    -11.675      3.913  -2.984  0.00664 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.712 on 23 degrees of freedom
## Multiple R-squared:  0.2791, Adjusted R-squared:  0.2477 
## F-statistic: 8.903 on 1 and 23 DF,  p-value: 0.006639
##                 2.5 %    97.5 %
## (Intercept)  34.48775 45.226540
## SEXfemale   -19.77000 -3.580653
## 
##  Two Sample t-test
## 
## data:  FEV1 by SEX
## t = 2.9837, df = 23, p-value = 0.006639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.580653 19.769996
## sample estimates:
##   mean in group male mean in group female 
##             39.85714             28.18182

Discussion

We can observe that two beta’s are in the linear model: the intercept and the effect of SEX on FEV1. The second coefficient is the coefficient of interest. It can be interpreted as: if SEX=female, FEV1 is 11.675 lower than if SEX=male. This difference is significant, with a p=0.0066. The 95% confidence interval of this difference is -19.77 - -3.58. This confidence interval excludes zero, because the difference is significant. If we check this with a t-test, we can see that the p-value and confidence interval is equal to the ones obtained in the regression. Note that the function t.test() assumes non-equal variances of FEV1 witin the groups of SEX. Since the regression assumes this, we obtain a slightly different answer when we do not specify var.equal=TRUE.

Analysis of covariance

    Next, we will analyze the difference between males and females in FEV1 by analysis of covariance, focussing on WEIGHT.
  1. Look at the difference in body weight between males and females by carrying out a t-test. The difference turns out to be not significant. May we conclude from this that it does not make sense to adjust the difference in mean FEV1 between males and females for weight?

  2. Adjust for weight by adding WEIGHT as a predictor variable to the model fitted in the crude difference model. Give the estimated mean difference in FEV1 between male and female patients, adjusted for weight. Give also the 95% confidence interval.

Hints

###################################################
#  part b                                         #
###################################################
lm(... ~ ... + ..., data = df)         

Code

###################################################
#  part a                                         #
###################################################
t.test(WEIGHT ~ SEX, data = df)

###################################################
#  part b                                         #
###################################################
fit.adj <- lm(FEV1 ~ SEX + WEIGHT, data = df)
summary(fit.adj)
confint(fit.adj)

Output

## 
##  Welch Two Sample t-test
## 
## data:  WEIGHT by SEX
## t = 0.9771, df = 22.451, p-value = 0.3389
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.535395 20.991239
## sample estimates:
##   mean in group male mean in group female 
##             41.36429             34.63636
## 
## Call:
## lm(formula = FEV1 ~ SEX + WEIGHT, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.181  -5.849  -1.468   5.330  16.986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.5064     4.9609   6.149 3.44e-06 ***
## SEXfemale   -10.1544     3.7028  -2.742   0.0119 *  
## WEIGHT        0.2261     0.1048   2.157   0.0422 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.022 on 22 degrees of freedom
## Multiple R-squared:  0.4049, Adjusted R-squared:  0.3508 
## F-statistic: 7.484 on 2 and 22 DF,  p-value: 0.003316
##                     2.5 %     97.5 %
## (Intercept)  20.218181138 40.7946241
## SEXfemale   -17.833608041 -2.4752360
## WEIGHT        0.008691923  0.4434247

Discussion

It is observed that adding weight in the regression model results in a significant effect of WEIGHT on FEV1. The estimated difference between males and females in FEV1, adjusted for weight, is 10.15 (95% CI: 2.48 - 17.83). This is slightly different from the non-adjusted difference (which was 11.675, 95% CI: 3.58 - 19.77)

Interaction

    In the previous model, it is assumed that males and females have parallel lines for the effect of sex and weight on FEV1. In this part we will explore whether it is justifiable.
  1. To assess whether the assumption of parallel regression lines in the model used in d. is justified, we first fit the model for WEIGHT and FEV1 for both sexes separately. Plot them over the scatter plot. Is there much difference in slope for the two lines?

  2. To test the assumption of parallel regression, add an interaction term to the model. Is the assumption of no interaction reasonable?

Hints

###################################################
#  part a                                         #
###################################################
fit.fem <- lm(... ~ ..., data = df[df$SEX == "female", ])
fit.men <- lm(... ~ ..., data = df[df$SEX == "male", ])    
plot(...)
abline(...)

###################################################
#  part b                                         #
###################################################
lm(...~...*...)    #the "*" means "+", and adds the interaction

Code

###################################################
#  part a                                         #
###################################################
fit.fem <- lm(FEV1 ~ WEIGHT, data = df[df$SEX == "female", ])
fit.men <- lm(FEV1 ~ WEIGHT, data = df[df$SEX == "male", ])
plot(df$FEV1 ~ df$WEIGHT)
abline(fit.fem, col = "red")
abline(fit.men, col = "blue")

###################################################
#  part b                                         #
###################################################
fit.int <- lm(FEV1 ~ WEIGHT * SEX, data = df)
summary(fit.int)

Output

## 
## Call:
## lm(formula = FEV1 ~ WEIGHT * SEX, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.8606  -6.1095  -0.9119   4.7378  17.0240 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      29.69290    5.67831   5.229  3.5e-05 ***
## WEIGHT            0.24573    0.12370   1.986   0.0602 .  
## SEXfemale        -7.31330    9.72627  -0.752   0.4604    
## WEIGHT:SEXfemale -0.07821    0.24668  -0.317   0.7543    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.212 on 21 degrees of freedom
## Multiple R-squared:  0.4077, Adjusted R-squared:  0.3231 
## F-statistic: 4.819 on 3 and 21 DF,  p-value: 0.01047

Discussion

we can see that the lines in the plot are slightly diverging. However, in this sample (between 20 and 70 kg) they do not cross. The interaction term in the model (WEIGHT:SEXfemale) is not significant (p=0.754), therefore we conclude that the lines regress parallel.

Exercise 20: Covariance analysis II (3 groups)

This exercise concerns the data set BLDPRES.SAV. We want to investigate whether there are differences in mean systolic blood pressure between the three social-economic status groups, adjusted for differences in body weight.

Crude regression

    The first step is to estimate the crude difference in mean systolic blood pressure between male and female patients. We will use regression to do this.
  1. If not already done, start RStudio. Open the file BLDPRES. Since this is a .sav file, we need to import this using a function from the foreign package: read.spss().

  2. Check whether ses is coded as a factor, and if so what the coding is of the variable. How do the coefficients of this variable should be interpreted?

  3. Fit the multiple regression model with sys as dependent and ses as independent variable. What is the global conclusion concerning the differences in systolic blood pressure? Give the (unadjusted) differences, together with a confidence interval and the P value, between the middle and the lower class and between the upper and the lower class.

  4. Check that the one-way ANOVA gives the same p-value.

Hints

###################################################
#  part a                                         #
###################################################
install.packages("foreign")  #install package first time
library(foreign)
read.spss(..., to.data.frame = TRUE) 

###################################################
#  part b                                         #
###################################################
class(...)         #to check the class
factor(...)        #to recode the variable in a factor
contrasts(...)     #to obtain the coding for the regression

###################################################
#  part d                                         #
###################################################
summary(aov(...))  #to run a one-way anova

Code

###################################################
#  part a                                         #
###################################################
install.packages("foreign")  #install package first time
library(foreign)
df <- read.spss("./Data/BLDPRES.SAV", to.data.frame = TRUE)

###################################################
#  part b                                         #
###################################################
class(df$ses)
contrasts(df$ses)

###################################################
#  part c                                         #
###################################################
fit.crude <- lm(sys ~ ses, data = df)
summary(fit.crude)
confint(fit.crude)

###################################################
#  part d                                         #
###################################################
summary(aov(df$sys ~ df$ses))

Output

## [1] "factor"
##                middle class upperclass
## lower class               0          0
##  middle class             1          0
## upperclass                0          1
## 
## Call:
## lm(formula = sys ~ ses, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.736 -11.685  -2.736   7.264  62.778 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       124.185      1.453  85.470  < 2e-16 ***
## ses middle class   -1.449      2.099  -0.690  0.49058    
## sesupperclass       8.037      2.595   3.097  0.00213 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.07 on 325 degrees of freedom
## Multiple R-squared:  0.04072,    Adjusted R-squared:  0.03482 
## F-statistic: 6.898 on 2 and 325 DF,  p-value: 0.001164
##                       2.5 %    97.5 %
## (Intercept)      121.326369 127.04320
## ses middle class  -5.577575   2.68045
## sesupperclass      2.931768  13.14311
##              Df Sum Sq Mean Sq F value  Pr(>F)   
## df$ses        2   4019  2009.7   6.898 0.00116 **
## Residuals   325  94684   291.3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Discussion

The analysis of variance (ANOVA) test shows a p-value of 0.00116. This is exactly the same p-value as for the linear model (note, compared to the overall ANOVA test, not to the individual coefficients). we see a significant difference in the three ses groups. However, from the regression, we can observe that the upperclass has a higher systolic blood pressure than the other two groups, which do not differ significantly.

Adjusted regression

    Consecutively, we are adjusting for weight, because we believe it to be a confounder.
  1. Fit a multiple regression model, with sys as dependent, and ses and weight as independent variables.

  2. Check whether ses still contributes significantly to the model (note: which models should be compared?).

  3. The model fitted in a assumes that the differences in mean systolic blood pressures do not depend on weight. Check that assumption by adding the appropriate interaction terms. What is your conclusion?

Hints

###################################################
#  part c                                         #
###################################################
lm(... ~ ... * ...)

Code

###################################################
#  part a                                         #
###################################################
fit.adj <- lm(sys ~ ses + weight, data = df)

###################################################
#  part b                                         #
###################################################
fit.weight <- lm(sys ~ weight, data = df)
anova(fit.adj, fit.weight)

###################################################
#  part C                                         #
###################################################
fit.int <- lm(sys ~ ses * weight, data = df)
summary(fit.int)

Output

## 
## Call:
## lm(formula = sys ~ ses + weight, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.005 -10.230  -3.373   6.680  61.855 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      112.21601    4.35871  25.745  < 2e-16 ***
## ses middle class  -3.18392    2.15923  -1.475  0.14130    
## sesupperclass      4.58305    2.82755   1.621  0.10602    
## weight             0.25945    0.08921   2.908  0.00388 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.88 on 324 degrees of freedom
## Multiple R-squared:  0.06513,    Adjusted R-squared:  0.05647 
## F-statistic: 7.524 on 3 and 324 DF,  p-value: 6.995e-05
## Analysis of Variance Table
## 
## Model 1: sys ~ ses + weight
## Model 2: sys ~ weight
##   Res.Df   RSS Df Sum of Sq Pr(>Chi)  
## 1    324 92275                        
## 2    326 94776 -2   -2501.5  0.01238 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = sys ~ ses * weight, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.839 -10.110  -3.473   6.841  62.100 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             113.70175    8.00943  14.196   <2e-16 ***
## ses middle class         -7.78273   10.71896  -0.726    0.468    
## sesupperclass             7.18875   13.23063   0.543    0.587    
## weight                    0.22725    0.17080   1.331    0.184    
## ses middle class:weight   0.09115    0.21576   0.422    0.673    
## sesupperclass:weight     -0.03662    0.24345  -0.150    0.881    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.92 on 322 degrees of freedom
## Multiple R-squared:  0.06627,    Adjusted R-squared:  0.05177 
## F-statistic: 4.571 on 5 and 322 DF,  p-value: 0.0004837

Discussion

It is observed that when correcting for weight, the ses classes do not differ significantly from each other anymore (p>0.05). However, when comparing the model with ses versus the model without ses using a likelihood ratio test, it is observed that the p=0.012. In conclusion, ses still is predicting sys in the multivariable model, but the groups do not differ significantly from each other. Furthermore, it is safe to assume that the difference in systolic blood pressure is not dependent on weight, because the interaction terms (sesmidlleclass:weight and sesupperclass:weight) between the variables ses and weight are not signinficant (p=0.67 and p=0.88, respectively).

Exercise 21: Covariance analysis

Now (again for the BLDPRES-data) we want to investigate whether there are differences in mean diastolic blood pressure between the three social economic status groups, adjusted for differences in pulse rate.

Linear regression

    First, we fit a linear model to investigate the association between diastolic blood pressure, SES class and pulse rate.
  1. Open the BLDPRES.sav file with the read.spss function. Check whether ses is coded as a factor, and if so what the coding is of the variable.

  2. Fit a linear model with diastolic blood pressure as dependent variable, and SES class and pulse rate as independent variables. Summarize your conclusions.

  3. Investigate whether there is interaction between SES class and the pulse rate. Also try to plot the regression lines of the different SES classes with the ggplot function.

  4. Give the difference in the mean diastolic blood pressure between the middle and the low class as a function of the pulse rate. Do the same for the difference between the high and the low class.

Hints

###################################################
#  part c                                         #
###################################################

# fit a new model with interaction between SES class and pulse

# then plot the regression lines for the different SES classes
ggplot(df, aes(y = ..., x = ..., color = ...)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

###################################################
#  part d                                         #
###################################################

# check the model summary again to analyze the regression equitation
summary(...)

Code

###################################################
#  part a                                         #
###################################################
library(foreign)
df <- read.spss("./data/BLDPRES.sav", to.data.frame = TRUE)

class(df$ses)
contrasts(df$ses)

###################################################
#  part b                                         #
###################################################
fit1 <- lm(dias ~ ses + pulse, data = df)
summary(fit1)  #coefficients
# beta high SES: 0.033, P = 0.979 (reference category: low SES)
# beta mid SES: 3.322, P = 0.031 (reference category: low SES)
# beta pulse: 0.107, P = 0.026

# note that a model including dummies for SES class will give the same results 
levels(df$ses)  # please note: there is a space before " middle class"
mid.ses <- ifelse(df$ses == " middle class", 1, 0) 
high.ses <- ifelse(df$ses == "upperclass", 1, 0)

fit2 <- lm(dias ~ mid.ses + high.ses + pulse, data = df)
summary(fit2)

###################################################
#  part c                                         #
###################################################
fit3 <- lm(dias ~ pulse * ses, data = df)
summary(fit3)

library(ggplot2)
ggplot(df, aes(y = pulse, x = dias, color = ses)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

###################################################
#  part d                                         #
###################################################
summary(fit3)

# regression equitation:
# dias = 74.56 + 0.07 * pulse + 6.68 * ses middle class - 30.47 * sesupperclass - 0.08 * pulse * ses middle class + 0.41 * pulse * sesupperclass

# predictions for low SES:     dias = 74.56 + 0.07 * pulse
# predictions for middle SES:  dias = 74.56 + 0.07 * pulse + 6.68 - 0.08 * pulse
#                             ---------------------------------------------------
# difference:                                                6.68 - 0.08 * pulse

# predictions for low SES:     dias = 74.56 + 0.07 * pulse
# predictions for high SES:    dias = 74.56 + 0.07 * pulse - 30.47 + 0.41 * pulse
#                             ---------------------------------------------------
# difference:                                                - 30.47 + 0.41*pulse

Output

## re-encoding from CP1252
## [1] "factor"
##                middle class upperclass
## lower class               0          0
##  middle class             1          0
## upperclass                0          1
## 
## Call:
## lm(formula = dias ~ ses + pulse, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.1714  -6.8993  -0.2905   6.0025  29.3286 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      71.71777    4.00409  17.911   <2e-16 ***
## ses middle class  0.03281    1.23855   0.026   0.9789    
## sesupperclass     3.32237    1.53131   2.170   0.0308 *  
## pulse             0.10659    0.04753   2.243   0.0256 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.07 on 324 degrees of freedom
## Multiple R-squared:  0.03257,    Adjusted R-squared:  0.02361 
## F-statistic: 3.635 on 3 and 324 DF,  p-value: 0.0132
## 
## Call:
## lm(formula = dias ~ pulse * ses, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.1123  -7.4857  -0.4016   5.3243  29.3877 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             74.56009    6.47520  11.515  < 2e-16 ***
## pulse                    0.07205    0.07802   0.923  0.35645    
## ses middle class         6.68121    8.59382   0.777  0.43747    
## sesupperclass          -30.47143   11.29895  -2.697  0.00737 ** 
## pulse:ses middle class  -0.08181    0.10378  -0.788  0.43107    
## pulse:sesupperclass      0.40738    0.13529   3.011  0.00281 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.88 on 322 degrees of freedom
## Multiple R-squared:  0.07416,    Adjusted R-squared:  0.05978 
## F-statistic: 5.158 on 5 and 322 DF,  p-value: 0.000144

Discussion

In the first model, there was a significant effect of high SES and pulse on diastolic bloodpressure, but no significant effect of middle SES. The interaction term between high SES and pulse (in the second model) was significant, so the effect of high SES is different for different pulse rates.

ANOVA

    We could also fit this model with ANOVA. Do this using the aov function and compare the output with the results from the linear regression model.

      Hints

      aov(... ~ ..., data = ...)
      # aov() is automatically fitting the variables sequentially (i.e. adding it to the model in the order specified) 
      # instead, we want to fit the variables marginally (i.e. treating it as the last variable added to the model)
      # use the drop1 function to correct this
      drop1(..., ~., test = "F")   
      
      # also take a look at the diagnostic plots
      layout(matrix(c(1, 2, 3, 4), 2, 2)) # create a 2x2 layout for the plots
      plot(...) 

      Code

      fit4 <- aov(dias ~ pulse * ses, data = df)
      drop1(fit4, ~., test = "F")   
      coefficients (fit4)
      # simular results as the previous fit3 model
      
      # also take a look at the diagnostic plots
      layout(matrix(c(1, 2, 3, 4), 2, 2)) # create a 2x2 layout for the plots
      plot(fit4) 

      Output

      ## Single term deletions
      ## 
      ## Model:
      ## dias ~ pulse * ses
      ##           Df Sum of Sq   RSS    AIC F value    Pr(>F)    
      ## <none>                 31433 1508.5                      
      ## pulse      1     83.25 31516 1507.4  0.8528 0.3564499    
      ## ses        2   1162.47 32596 1516.4  5.9542 0.0028894 ** 
      ## pulse:ses  2   1412.16 32845 1519.0  7.2331 0.0008457 ***
      ## ---
      ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      ##            (Intercept)                  pulse       ses middle class 
      ##            74.56008611             0.07205073             6.68121194 
      ##          sesupperclass pulse:ses middle class    pulse:sesupperclass 
      ##           -30.47143064            -0.08181469             0.40737847

      Discussion

      The results from the ANOVA model (with sequential variable testing) are the same as from the linear regression model. The residual plot shows no evident non-linearity, the scale-location plot shows no clear signs of non-equal variance, the normal Q-Q plot shows no violation of the normality assumption and the residuals vs leverage plot shows no influential outliers.